Search CORE

24 research outputs found

Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices

Author: Azad Ariful
Buluc Aydin
Ekanayake Saliya
Guidi Giulia
Pavlopoulos Georgios
Selvitopi Oguz
Publication venue
Publication date: 30/09/2020
Field of study

Identifying similar protein sequences is a core step in many computational biology pipelines such as detection of homologous protein sequences, generation of similarity protein graphs for downstream analysis, functional annotation and gene location. Performance and scalability of protein similarity searches have proven to be a bottleneck in many bioinformatics pipelines due to increases in cheap and abundant sequencing data. This work presents a new distributed-memory software, PASTIS. PASTIS relies on sparse matrix computations for efficient identification of possibly similar proteins. We use distributed sparse matrices for scalability and show that the sparse matrix infrastructure is a great fit for protein similarity searches when coupled with a fully-distributed dictionary of sequences that allows remote sequence requests to be fulfilled. Our algorithm incorporates the unique bias in amino acid sequence substitution in searches without altering the basic sparse matrix model, and in turn, achieves ideal scaling up to millions of protein sequences.Comment: To appear in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'20

arXiv.org e-Print Archive

eScholarship - University of California

Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets

Author: Adam Hughes
Geoffrey Fox
J Ekanayake
J Ekanayake
Judy Qiu
Mina Rho
Qunfeng Dong
S Bae
Saliya Ekanayake
SB Needleman
Seung-Hee Bae
X Qiu
Y Sun
Y Ye
Yang Ruan
Publication venue: BioMed Central
Publication date: 01/03/2012
Field of study

Abstract Background Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as <it>16S rRNA</it>, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets. Methods Pairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of <it>rRNA </it>genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters. Results This study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS. Conclusions Although work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

UNT Digital Library

The Parallelism Motifs of Genomic Data Analysis

Author: Awan Muaaz
Azad Ariful
Brock Benjamin
Buluc Aydin
Egan Rob
Ekanayake Saliya
Ellis Marquita
Georganas Evangelos
Guidi Giulia
Hofmeyr Steven
Oliker Leonid
Selvitopi Oguz
Teodoropol Cristina
Yelick Katherine
Publication venue: 'The Royal Society'
Publication date: 20/01/2020
Field of study

Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

arXiv.org e-Print Archive

eScholarship - University of California

Hybrid cloud and cluster computing paradigms for life science applications

Author: Adam Hughes
Bingjing Zhang
C Evangelinos
Chu
E Walker
G Fox
GC Fox
GC Fox
GC Fox
Geoffrey Fox
Hui Li
J Dean
J Ekanayake
J Ekanayake
J Ekanayake
J Ekanayake
J Ekanayake
J Lange
Jaliya Ekanayake
Jong Youl Choi
Judy Qiu
JW Sammon
Saliya Ekanayake
Seung-Hee Bae
SH Bae
T Gunarathne
Tak-Lon Wu
Thilina Gunarathne
X Qiu
Yang Ruan
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Phylogenetically Structured Differences in rRNA Gene Sequence Variation among Species of Arbuscular Mycorrhizal Fungi and Their Implications for Sequence Clustering

Author: Bever James D.
Ekanayake Saliya
Fox Geoffrey
House Geoffrey L.
Kaonongbua Wittaya
Ruan Yang
Schutte Ursel M. E.
Ye Yuzhen
Publication venue: 'American Society for Microbiology'
Publication date: 29/07/2016
Field of study

Arbuscular mycorrhizal (AM) fungi form mutualisms with plant roots that increase plant growth and shape plant communities. Each AM fungal cell contains a large amount of genetic diversity, but it is unclear if this diversity varies across evolutionary lineages. We found that sequence variation in the nuclear large-subunit (LSU) rRNA gene from 29 isolates representing 21 AM fungal species generally assorted into genus- and species-level clades, with the exception of species of the genera Claroideoglomus and Entrophospora. However, there were significant differences in the levels of sequence variation across the phylogeny and between genera, indicating that it is an evolutionarily constrained trait in AM fungi. These consistent patterns of sequence variation across both phylogenetic and taxonomic groups pose challenges to interpreting operational taxonomic units (OTUs) as approximations of species-level groups of AM fungi. We demonstrate that the OTUs produced by five sequence clustering methods using 97% or equivalent sequence similarity thresholds failed to match the expected species of AM fungi, although OTUs from AbundantOTU, CD-HIT-OTU, and CROP corresponded better to species than did OTUs from mothur or UPARSE. This lack of OTU-to-species correspondence resulted both from sequences of one species being split into multiple OTUs and from sequences of multiple species being lumped into the same OTU. The OTU richness therefore will not reliably correspond to the AM fungal species richness in environmental samples. Conservatively, this error can overestimate species richness by 4-fold or underestimate richness by one-half, and the direction of this error will depend on the genera represented in the sample. IMPORTANCE Arbuscular mycorrhizal (AM) fungi form important mutualisms with the roots of most plant species. Individual AM fungi are genetically diverse, but it is unclear whether the level of this diversity differs among evolutionary lineages. We found that the amount of sequence variation in an rRNA gene that is commonly used to identify AM fungal species varied significantly between evolutionary groups that correspond to different genera, with the exception of two genera that are genetically indistinguishable from each other. When we clustered groups of similar sequences into operational taxonomic units (OTUs) using five different clustering methods, these patterns of sequence variation caused the number of OTUs to either over- or underestimate the actual number of AM fungal species, depending on the genus. Our results indicate that OTU-based inferences about AM fungal species composition from environmental sequences can be improved if they take these taxonomically structured patterns of sequence variation into account

KU ScholarWorks

PubMed Central

Towards a systematic study of big data performance and benchmarking

Author: Ekanayake Saliya
Publication venue: 'Indiana University Press (Project Muse)'
Publication date: 01/01/2016
Field of study

Big data queries are increasing in complexity and the performance of data analytics is of growing importance. To this end, Big Data on high-performance computing (HPC) infrastructure is becoming a pathway to high-performance data analytics. The state of performance studies on this convergence between Big Data and HPC, however, is limited and ad hoc. A systematic performance study is thus timely and forms the core of this research. This thesis investigates the challenges involved in developing Big Data applications with significant computations and strict latency guarantees on multicore HPC clusters. Three key areas it considers are thread models, affinity, and communication mechanisms. Thread models discuss the challenges of exploiting intra-node parallelism on modern multicore chips, while affinity looks at data locality and Non-Uniform Memory Access (NUMA) effects. Communication mechanisms investigate the difficulties of Big Data communications. For example, parallel machine learning depends on collective communications, unlike classic scientific simulations, which mostly use neighbor communications. Minimizing this cost while scaling out to higher parallelisms requires non-trivial optimizations, especially when using high-level languages such as Java or Scala. The investigation also includes a discussion on performance implications of different programming models such as dataflow and message passing used in Big Data analytics. The optimizations identified in this research are incorporated in developing the Scalable Parallel Interoperable Data Analytics Library (SPIDAL) in Java, which includes a collection of multidimensional scaling and clustering algorithms optimized to run on HPC clusters. Besides presenting performance optimizations, this thesis explores a novel scheme for characterizing Big Data benchmarks. Fundamentally, a benchmark evaluates a certain performance-related aspect of a given system. For example, HPC benchmarks such as LINPACK and NAS Parallel Benchmark (NPB) evaluate the floating-point operations (flops) per second through a computational workload. The challenge with Big Data workloads is the diversity of their applications, which makes it impossible to classify them along a single dimension. Convergence Diamonds (CDs) is a multifaceted scheme that identifies four dimensions of Big Data workloads. These dimensions are problem architecture, execution, data source and style, and processing view. The performance optimizations together with the richness of CDs provide a systematic guide to developing high-performance Big Data benchmarks, specifically targeting data analytics on large, multicore HPC clusters

ProQuest OAI Repository